1 Introduction

Housing prices are an important indicator of the strength of the economy. House price prediction can help real estate developers determine the selling price of a house, allow buyers make informed choices about potential purchases, and be beneficial for property investors in determining price trends across different locations. Hence having a simple predictive and inferential method to model housing prices can be of great significance to the financial market; however, predicting long-term housing prices has become a complex and challenging task. This paper discusses our project on determining how different factors may affect home sales price by building linear models. The data used in this project was collected in Melbourne, Australia in 2017. Melbourne is a large metropolitan city with a strong real estate market in a region of Australia that experienced a 4.2 percent growth rate in property sales 2017. We believe the factors that determine housing pricing in our model could have broad applications to other locations and countries.

Our project sought to answer the following questions:

  • Understand if housing prices in Melbourne, Australia can be predicted using this dataset.
  • Determine what variables have the greatest impact on housing price.
  • Analyze the impacts of location, seller, and construction attributes of homes on the housing market in Melbourne, Australia.

1.1 The Melbourne Housing Snapshot Dataset

The independent variables mainly reflect the situation of the house from three dimensions: a. what type; b. quality, grade; c. quantity, area. Before exploratory data analysis, the details and introduction of the existing Melbourne house data variables are as follows:

  • Home Sales in 2017
    • Location
    • Construction
    • Sale
  • Variables: 21
    • Numeric: 12
    • Categorical: 9

Rooms: Number of rooms

Price: Price (AUS$)

Method: Method of sale - 5 categories

Type: House, Unit, Townhouse - 3 categories

SellerG: Real Estate Agent - 268 categories

Date: Date sold

Distance: Distance from Central Business District

Regionname: Region name - 8 categories

Propertycount: Number of properties that exist in the suburb

Bedroom2 : Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year home built

CouncilArea: Governing council for the area - 34 categories

Lattitude, Longtitude: GPS location

Suburb: Suburb name - 314 categories

2 Exploratory Data Analysis (EDA)

The following are excerpts and graphs from our exploratory data analysis. This part of the project familiarizes the reader with our dataset’s attributes as well as lays the foundation for the variables we will include in our linear model. The results of our EDA will also inform the future direction of the project.

2.1 Summary of Price Statistics

SD: 639310.724

data.full$Price
Min 85000
Q1 650000
Median 903000
Mean 1075684
Q3 1330000
Max 9000000

2.2 Select Data Pairs

By these scatter plots of selected numerical variables, we can see there are couple of potential outliers when it comes to home size, land size, and selling price. This does make it a bit difficult to discern patterns in some of the pairings. Nonetheless, there does appear to be an inkling of linear correlation which we will explore further.

2.3 Corrleations

Due to the large number of feature columns in the dataset, it is difficult to grasp all the pairings in linear correlations. Therefore, before further feature mining, take a look at which variables are highly correlated with House Price. The below heatmap is interactive if viewing in HTML. Hover over the specific coordinate to view the correlation coefficient and the p-value for that variable pair.

The general trajectory of most of these correlations should be unsurprising. One would expect a home with more rooms to be positively correlated with both number of bedrooms and price. Alternatively, distance from the central business district is slightly negatively correlated with price. And there are also strong correlations between some variables, such as Rooms and Bedrooms, Rooms and Bathrooms, Bedrooms and Bathrooms, which may be a concern for multicollinearity. This will guide our feature selection for linear regression and bear out in the significance and variance inflation within the attempted models.

2.4 Map of Melbourne Sales

A visual analysis of the current housing sales distribution in Melbourne was carried out. The result is shown in the figure below.

It can be seen from the figure that the sales areas are mainly concentrated in Eastern metropolitan, Southern metropolitan, Northern Metropolitan, Western Metropolitan and South-Eastern metropolitan. Therefore, the fluctuation of housing prices will greatly affect these areas, and these areas account for about 5/6 of Melbourne.

2.5 Selling Price

Selling price is the variable that we would like to predict and infer upon by linear modeling. So, let us further explore its distribution.

By histogram, box-plot, and qq-plot, selling price appears skewed from normal. This makes sense as no sales were less than $85,000 and there is a theoretical hard stop on the left at $0. Similarly, housing prices can be quite high without theoretical limit. That pattern is clearly displayed here.

The log transformed price is normally distributed; this can be seen in the histogram, boxplot, and qq-plot; all showing only very slight deviation from normal. Hence, selling price is log-normal.

2.5.1 Selling Price’s Relationship to Select Categorical Variables

We suspect location, seller, and type of home interacts with sale price. Furthermore, by the above heat-map, the price is correlated to number of rooms which can be treated as categorical; the more rooms there are, the higher the price.

2.5.2 Price by Region

By box-plot, there does appear to be an dependence between region and price. Note that the data still appear non-normal and suspiciously log-normal. Although the number of observations in each level of region are sufficiently large, there is great variability in the size of each level. Western Victoria has only 32 sales but Southern Metropolitan had 4695. Since normality is not satisfied and with uneven sized levels, we cannot rely on the robustness of ANOVA to test for independence.

2.5.3 Price by Number of Rooms (<10 Rooms)

Due to the small size of levels of rooms greater than 9, they are omitted. Again, we can see data that one can suspect is log-normal. There appears to be in dependence based on the box-plot especially for homes with less than six rooms; this is less pronounced as number of rooms increases. Again, the size of the levels is highly uneven; hence, we cannot apply ANOVA testing for dependence in this case either.

2.5.4 Price by Type of Home

The pattern of non-normal data and uneven level sizes repeats when price is conditioned on type of home (house, apartment, townhouse). So, again, ANOVA is not appropriate.

2.5.5 Test of Independence by Group (Pearson \(\chi^2\))

Type, Rooms, Regionname, SellerG

Since anova was a no due to assumptions, we cut price into 5 uneven levels of at least 8 observations and performed Pearson Chi Squared.

\(H_0\): All means equal by group

All reject \(H_0\) with p-value\(<2\times 10^{-16}\)

The null hypothesis is that there is no difference in means among each level of the four categorical variables that we tests. As expected, we rejected null.

2.6 Price by Region and Type

The picture below shows the price with different region and type of house.

2.7 Transform Data - Homogeneity

This is a residual anlysis on the transformed log-data linear model. We may want to move this because it is on the modelling; it is just a model that we threw out because we relied on the robustness of linear regression since the pattern in the proposed model was visible but not super pronounced.

2.8 Transform Data - Normality

This is a qq plot used to check for normality on the residuals from 3.15. It’s slightly improved from our proposed model but IMO, not enough to really push for the log transform.

3 Linear Modelling

3.1 First Attempt at Linear Model

In this first model, we set the regression model as: Price ~ Rooms + Landsize + Distance + Bedroom2 + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname). In order to determine whether the first model meets the requirements, the necessary VIF checks are useful. From the picture, we can know that the bedroom2 is not suitable for house price regression analysis. Thus, the second model regression analysis is modified with no bedroom2 variable.

3.2 Linear Model 2: Removed the Variable with Highest VIF

In this model, we set the regression model as: Price ~ Rooms + Landsize + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname). Similarly with Model 1, the VIF’s results shown as follows:

3.3 Model Coefficients

In this model, the factor (Regionname) Western Metropolitan is the variable with highest VIF value. The model coefficients are shown as follows:

The p-value of Landsize is the highest, which is larger than 0.05. Thus, this variable should be dropped.

3.4 Linear Model 3: Considered Interactions

In this model, we analyze the interaction of rooms and region. The result is shown as follows:

The p-value for model with interaction shows that all of the variables are suitable and acceptable.

3.5 Linear Model 4: Removed Land Size

From the consequence of model 2, the variable of landsize should be dropped. In this model, the model is set as: Price ~ Rooms + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname).

From the results of VIF Value for model 4, the factor (Regionname) Western Metropolitan should be dropped. At the same time, the VIF value of this variable is the only variable that exceeds 5.

3.6 Model 4 Coefficients

# Residual Analysis

3.7 Homogeneity? No

3.8 Normal? Not Quite

3.9 Influence? Yes

3.10 Remove Influence Points

Homogeneity and normal for Resididual Analysis about model 4 is implemented. The result shows that it’s necessary to remove the influence points.

4 Proposed Model

This final proposed model and details of related coefficients are shown as follows:
Observations 4955 (4548 missing obs. deleted)
Dependent variable Price
Type OLS linear regression
F(15,4939) 425.77
0.56
Adj. R² 0.56
Est. S.E. t val. p
(Intercept) -129445309.64 17287919.00 -7.49 0.00
Rooms 255713.04 9273.23 27.58 0.00
Distance -44501.15 1467.41 -30.33 0.00
Bathroom 115532.73 11453.40 10.09 0.00
Car 45801.33 7532.08 6.08 0.00
BuildingArea 1794.92 90.67 19.80 0.00
Lattitude -757249.47 124821.36 -6.07 0.00
Longtitude 696894.60 116542.04 5.98 0.00
Propertycount -3.69 1.52 -2.42 0.02
factor(Regionname)Eastern Victoria 188304.28 103229.34 1.82 0.07
factor(Regionname)Northern Metropolitan -55889.99 30219.79 -1.85 0.06
factor(Regionname)Northern Victoria 598550.10 116506.82 5.14 0.00
factor(Regionname)South-Eastern Metropolitan 169831.25 51506.18 3.30 0.00
factor(Regionname)Southern Metropolitan 212777.72 27309.69 7.79 0.00
factor(Regionname)Western Metropolitan -86943.29 38731.01 -2.24 0.02
factor(Regionname)Western Victoria 515064.40 135131.01 3.81 0.00
Standard errors: OLS

4.1 Testing \(R^2\)

\[ \begin{equation} R^2 = 1- \dfrac{RSS}{TSS} \end{equation}=0.441\]

5 Conclusion

The previous analysis did not deal with outliers, and the processing of outliers may also have a certain effect on result optimization. Through the analysis of this data set, the content of linear regression was practiced, and the final effect was not bad. For the future, some useful method can be implemented, such as further explore log transformation, consider GLM with log link, what to do about factors with many levels (100’s)? deal with missing data and improve Prediction.

5.1 Future Work

  • Further explore log transformation
  • Consider GLM with log link
  • What to do about factors with many levels (100’s)?
  • Missing data
  • Improve Prediction

6 8 Bibliography

Dataset available: https://www.kaggle.com/dansbecker/melbourne-housing-snapshot

Thorne,S. (2019, November 3) How the Australian Property Market Performed in 2017. Retrieved from www.openagent.com.au/blog/how-the-australian-property-market-performed-in-2017#.

Mansfield, E. R., & Helms, B. P. (1982). Detecting multicollinearity. The American Statistician, 36(3a), 158-160.

Daoud, J. I. (2017, December). Multicollinearity and regression analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing.

Brownie, Cavell, and Dennis D. Boos. (1994). Type I Error Robustness of ANOVA and ANOVA on Ranks When the Number of Treatments Is Large. Biometrics, vol. 50, no. 2, 1994, p. 542.

Lix, Lisa M., et al. (1996). Consequences of Assumption Violations Revisited: A Quantitative Review of Alternatives to the One-Way Analysis of Variance ‘F’ Test. Review of Educational Research, vol. 66, no. 4, 1996, p. 579.